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(57) Abstract 



A barrier mechanism provides a low-latency method of synchronizing all or some of the processing elements (PEs) in a massively 
parallel processing system. The barrier mechanism is supported by one or more physical barrier svnchronizauoo circuits each receiving 
an input from every PE in the processing system. Each PE has an associated barrier synchronization register, in which each bit is used 
as an input to one of a plurality of logical barrier synchronisation circuits. The hardware supports both a conventional barrier function 
and an alternative eureka function. Each bit in the barrier synchronization registers can be programmed to perform as either barrier or 
eureka function, and all bits of the registers and each barrier synchronization circuits functions independently. Partitioning among PEs is 
accomplished by a barrier mask and interrupt register which enable certain of the bits in the barrier synchronization registers to a denned 
groups of PEs. Further partitioning is accomplished by providing bypass points in the physical barrier synchronization circuits to subdivide 
the physical barrier synchronization circuits into several types of PE barrier partitions of varying size and shape. The barrier mask and 
interrupt register and the bypass points are used in concert to accomplish flexible and scalable partitions corresponding to user-desired sizes 
and shapes with a latency several orders of magnitude faster than existing software implementations. 
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BARRIER SYNCHRONIZATION FOR DISTRIBUTED 
MEMORY MASSIVELY PARALLEL PROCESSING SYSTEMS 

5 

Field of t he Invention 

The present invention relates generally to massively parallel processing 
systems, and more particularly to a barrier synchronization mechanism for facilitating 
synchronization between multiple processors in such a system. 

10 

Bflfikgnmnd of the Invention 
Massively parallel processing (MPP) systems are computing systems 
comprised of hundreds or thousands of processing elements (PEs). While the power 
of a multiple instruction-multiple data (MIMD) computer system lies in its ability to 
15 execute independent threads of code simultaneously, the inherently asynchronous 
states of the PEs (with respect to each other) makes it difficult in such a system to 
enforce a deterministic order of events when necessary. Program sequences 
involving interaction between multiple PEs such as coordinated communication, 
sequential access to shared resources, controlled transitions between parallel regions, 
20 etc., may require synchronization of the PEs in order to assure proper execution. 

An important synchronization capability in any programming model is 
the barrier. Barriers are points placed in the code beyond which no processor 
participating in a computation may proceed before all processors have arrived. Since 
processors wait at a barrier until alerted that all PEs have arrived, the latency of the 
25 barrier mechanism can be very important. The latency of a barrier mechanism is the 
propagation time between when the last processor arrives at a barrier, and when all 
processors have been notified that the barrier has been satisfied. During this period 
of time, all PEs may be idle waiting for the barrier to be satisfied. Hence barriers 
are a serialization point in a parallel code. 
30 Barriers can be implemented entirely by software means, but software 

schemes are typically encumbered by long latencies and/or limited parallelism 
restricting how many processors can arrive at the barrier simultaneously without 
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artificial serialization. Because a barrier defines a serialization point in a program, it 
is important to keep the latency as low as possible. 

Hardware support for barriers, while addressing the latency problems 
associated with barriers implemented by purely software means, can have other 
5 shortcomings that limit the utility of the mechanism in a production computing 
system. Production computing systems demand that the barrier resource (like all 
resources) be partitionable among multiple users while maintaining protective 
boundaries between users. In addition, the barrier resource must be rich enough to 
allow division between the operating system and the user executing within the same 
10 partition. Provision must also be made for fault tolerance to insure the robust nature 
of the barrier mechanism. 

Hardware mechanisms may also suffer from an inability to operate 
synchronously. This inability may require that a PE, upon discovering that a barrier 
has been satisfied (all PEs have arrived at that point in the program), wait until all 
15 PEs have discovered that the barrier has been reached before it may proceed through 
the next section of program code. The ability to operate synchronously enables the 
banier mechanism to be immediately reusable without fear of race conditions. 

Hardware mechanisms may also require that a PE explicitly test a 
barrier flag to discover when the barrier condition has been satisfied. This can 
20 prevent a PE from accomplishing other useful work while the barrier remains 

unsatisfied, or force the programmer to include periodic tests of the barrier into the 
program in order to accomplish other useful work while a barrier is pending. This 
can limit the usefulness of a barrier mechanism when used as a means of terminating 
speculative parallel work (e.g., a database search) when the work has been completed 
25 (e.g. the searched-for item has been found). 

tSlimiW™ ftf thg Invention 

To overcome the above described shortcomings in the art and provide 
key system resources necessary for production computing, the present invention 
30 provides a hardware mechanism that facilitates barrier synchronization in a massively 
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parallel processing system. The present barrier mechanism provides a partitionable, 
low-latency, immediately reusable, robust mechanism which can be used to alert all 
PEs in a partition when all of the PEs in that partition have reached a designated 
point in the program. The mechanism permits explicit testing of barrier satisfaction, 

5 or can alternately interrupt the PEs when a barrier has been satisfied. The present 
invention also provides an alternate barrier mode, called a eureka, that satisfies a 
barrier when any one PE has reached a designated point in the program, providing 
the capability to terminate speculative parallel work. The present barrier mechanism 
provides multiple barrier resources to a partition to allow pipelining of barriers to 

10 hide the latency as well as offering raw latencies 2-3 orders of magnitude faster than 
software implementations. The barrier mechanism is also partitionable for 
multi-users. Barriers are used to bulk synchronize the processors in a partition 
between loops where producer/consumer hazards may exist, or control entry and exit 
between segments of a program. 

15 

Brief Description nf the Drawings 
The foregoing and other objects, features and advantages of the 
invention, as well as the presently preferred embodiments thereof, will become 
apparent upon reading and understanding the following detailed description and 
20 accompanying drawings in which: 

FIGURE 1 shows a simplified block diagram of a representative MPP 
system with which the present barrier mechanism can be used; 

FIGURE 2 shows a simplified block diagram of a processing element 
(PE), including a processor, local memory and associated shell circuitry; 
25 FIGURE 3 shows the format of the barrier synchronization registers 

BSRO and BSR1; 

FIGURE 4 shows a simplified radix-2 barrier synchronization circuit; 
FIGURE 5 shows a simplified radix-4 barrier synchronization circuit; 
FIGURE 6 shows the format of the barrier synchronization function 
30 (BSFR) register; 
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FIGURE 7 shows the format of the barrier synchronization mask and 
interrupt (BSMI) register; 

FIGURE 8 shows the bypass points in a simplified radix-2 barrier 
synchronization circuit; 
5 FIGURE 9 shows the barrier synchronization circuit of FIGURE 8, 

with some of the bypass points redirected to the fanout block; 

FIGURES 10A-E show how a 512 PE system can be partitioned at 
each level of a barrier synchronization circuit; 

FIGURE 1 1 shows how the barrier synchronization mask and intemipt 
10 (BSMI) register is used in concert with the bypass points in the barrier 

synchronization circuits to achieve the desired shape and size of a particular 
partition; 

FIGURE 12 shows the format of the barrier timing register 
BARJTMG register; 
15 FIGURE 13 shows a block diagram of a bypass point; and 

FIGURE 14 shows the hardware state sequencing implementation for a 
single barrier bit. 

DETAILED DE SCRIPTION OF THE PREFERRED EMBODIMENT 
20 In the following detailed description of the preferred embodiment, 

reference is made to the accompanying drawings which form a part hereof, and in 
which is shown by way of illustration a specific embodiment in which the invention 
may be practiced. It is to be understood that the present detailed description is 
intended as exemplary only, and that other embodiments may be utilized and 
25 structural changes made without departing from the spirit and scope of the present 
invention. 

The preferred MPP system, for which the present invention provides 
hardware support for barrier synchronization, is a MIMD massively parallel 
multiprocessor with a physically distributed, globally addressable memory. A 
30 representative MPP system 100 is shown in FIGURE 1. The MPP system 100 
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contains hundreds or thousands of processors, each accompanied by a local memory 
and associated support circuitry. Each processor, local memory and support circuity 
component is called a processing element (PE). The PE's in the MPP system 100 are 
linked via an interconnect network. 
5 The preferred MPP system 100 has a physically distributed memory, 

wherein each processor has a favored, low latency, high bandwidth path to a local 
memory, and a longer latency lower bandwidth access to the memory banks 
associated with other processors over the interconnect network. In the preferred 
embodiment, the interconnect network is comprised of a 3-dimensional torus which 
10 when connected creates a 3-dimensional matrix of PEs. The torus design has several 
advantages, including speed of information transfers and the ability to avoid bad 
communication links. The interconnect network is also scalable in all three 
dimensions. The interconnect network is described in more detail in the copending 
and commonly assigned U.S. Patent Application Serial Number 07/983,979, entitled 
15 "Direction Order Routing in Multiprocessing Systems", to Gregory M. Thorsen, filed 
November 30, 1992, which is incorporated herein by reference. 

FIGURE 2 shows a simplified block diagram of a PE 200. An 
individual PE includes a high-performance RISC (reduced instruction set computer) 
microprocessor 202. In the preferred MPP system, microprocessor 202 is the 
20 DECChip 21064-AA RISC microprocessor, available from Digital Equipment 

Corporation. Each PE is coupled to a local memory 204 that is a distributed portion 
of the globally-addressable system memory, and includes a shell of circuitry that 
implements synchronization and communication functions facilitating interactions 
between processors. 

25 The shell circuitry includes an interconnection network router 206, 

used to connect multiple PEs in the three-dimensional toroidal "fabric". The 
interconnection network carries all data communicated between PEs and memories 
that are not local. An address centrifuge and block transfer engine 208 in the PE 
shell circuitry permits asynchronous (i.e., independent of the local processor) 

30 movement of data between the local memory 204 and remote memories associated 
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6 

with other PEs, such as block transfers, with flexible addressing modes that permit a 
high degree of control over the distribution of data between the distributed portions 
of the system memory. The address centrifuge and block transfer engine are 
described in detail in the copending and commonly assigned U.S. Patent Application 
entitled "RECURSIVE ADDRESS CENTRIFUGE FOR DISTRIBUTED MEMORY 
MASSIVELY PARALLEL PROCESSING SYSTEMS", filed on even date herewith 
to Fromm, which application is incorporated herein by reference. 

The shell circuitry also includes a data prefetch queue 210 which 
allows the processor 202 to directly initiate data movement between remote 
memories and the local processor in a way that can hide the access latency and 
permit multiple remote memory references to be outstanding. 

Synchronization circuits in the shell permit synchronization at various 
different levels of program or data granularity in order to best match the 
synchronization method that is "natural" for a given parallelization technique. At the 
15 finest granularity, data-level synchronization is facilitated by an atomic swap 

mechanism that permits the locking of data on an element-by-element basis. A more 
coarse grain data-level synchronization primitive is provided by a messaging facility, 
which permits a PE to send a packet of data to another PE and cause an interrupt 
upon message arrival, providing for the management of message queues and 
20 low-level messaging protocol in hardware. Control-level synchronization at the 
program loop level is provided by a large set of globally accessible 
fetch-and-increment registers that can be used to dynamically distribute work (in the 
form of iterations of a loop, for instance) among processors at run time. The present 
invention provides yet another form of control-level synchronization, barrier 
25 synchronization, which is useful to control transitions between major program blocks 
(i.e., between loops performing actions on the same data sets). 



30 



Barrier Synchronization 

The present invention provides hardware support for barrier 
synchronization which results in a low-latency method of synchronizing all or a 
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portion of the PEs in an MPP system. The barrier synchronization mechanism of the 
present invention may be used to perform two types of synchronization: barrier and 
eureka. 

A barrier is a point in program code where, after reaching the barrier, 
5 a processor must wait until all other processors participating in the computation have 
also reached the barrier. After all of the processors reach the barrier, the processors 
continue issuing program code. 

A programmer may place a barrier in a program between distributed, 
parallel loops performing operations on the same data. By doing this, the 
10 programmer ensures that all of the processors associated with a loop finish the loop 
(all writes to shared data complete) before any of the processors continue with other 
program instructions (access the new values of the shared data). 

A eureka is a point in program code where a condition is established 
that only a single processor need satisfy, thereby causing all processors to proceed 
15 beyond the eureka point. To participate in a eureka event, all processors initialize 
the eureka barrier mechanism described herein and enable an interrupt, then proceed 
executing program code to solve for the eureka condition. As soon as any processor 
completes the computation, it triggers the eureka, thus causing an interrupt to all PEs. 
The interrupt indicates that the eureka has been satisfied and all PEs may continue 
20 beyond the eureka point. 

Eureka synchronization has several uses, including database searches. 
Using eureka synchronization, a programmer can stop a database search as soon as 
any processor succeeds in finding the desired data rather than waiting for all of the 
processors to exhaust the search. 



25 



Logical Barrier Synchronization Circuits 

The preferred barrier mechanism has 16 logical barrier synchronization 
circuits. Each PE in the MPP system has an input to each of the 16 logical barrier 
synchronization circuits. The multiple barrier synchronization circuits facilitate 
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partitioning between users and the operating system, as well as providing redundant 

resources for fault tolerance. 

The inputs to the 16 logical barrier synchronization circuits are 16 bits 

which are contained in two special registers. Each PE contains two 8-bit registers 
5 called barrier synchronization register 0 (BSRO) 300 and barrier synchronization 

register 1 (BSR1) 301. FIGURE 3 shows the format of BSRO 300 and BSR1 301. 

Each of the 16 bits which comprise BSRO 300 and BSR1 301 is an input to one of 

the 16 logical barrier synchronization circuits. Thus, each PE has an input to each of 

the 16 logical barrier synchronization circuits. 
10 Barrier synchronization register 0 (BSRO) 300 is an 8-bit, readable and 

writable, general access register. Preferably, BSRO 300 contains the eight least 

significant barrier bits for a PE. When read from, the value of BSRO 300 represents 

the value of bits 2' through 20 of BSRO 300. The remaining bits are not valid. 

Table 1 shows the bit format of BSRO 300 and describes each bit of the register. 
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T a ble l 

BSRf) 300 Format 



Bits 


Name 


2° 


Barrier Bit 2° 


2' 


Barrier Bit 2 1 


2 2 


Barrier Bit 2 2 


2 3 


Barrier Bit 2 3 


2 4 


Barrier Bit 2* 


2 5 


Barrier Bit 2 J 


2 6 


Barrier Bit 2 5 


2 7 


Barrier Bit 2 7 




These bits are not used. 



15 Barrier synchronization register I (BSR1) 301 is an 8-bit, readable and 

writable, privileged access register. BSR1 301 contains the eight most significant 
barrier bits for a PE. When read from, the value of BSR1 301 represents the value 
of bits 2 15 through 2 s of BSR1 301 and 2 7 through 2° of BSR0 300. The remaining 
bits are not valid. Table 2 shows the bit format of BSR1 301 and describes each bit 

20 of the register. 
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Tab le 2 
P8R1 301 Ffirmat 

Bits Name 

2 7 -2° These bits are not used; however, when 

BSR1 301 is read, these bits contain the 
value of bits 2 7 through 2° of BSRO 300. 

5 2* Barrier Bit 2 8 

2 9 Barrier Bit 2 9 

2 10 Barrier Bit 2 10 

2" Barrier Bit 2 U 

2 12 Barrier Bit 2 12 

10 2 13 Barrier Bit 2 13 

2 14 Barrier Bit 2 M 

2 15 Barrier Bit 2 15 



15 



2 63 -2 16 These bits are not used. 



All 16 of the logical barrier synchronization circuits function 
identically and operate independently. An example of the operation of the barrier 
synchronization circuits will now be given, using bit 2 2 of BSRO 300 when it is used 
for barrier synchronization and for eureka synchronization as an example for 
20 purposes of illustration. 

It shall be understood that the remaining bits function in the same way 
as bit 2 2 in the following discussion. Each logical barrier synchronization circuit is 
implemented as an AND-tree and fanout-tree circuit. FIGURE 4 shows a barrier 
synchronization circuit 400 in a simplified MPP system using 2-input AND gates to 
25 form the AND-tree. For simplicity of illustration, the MPP system shown in 

FIGURE 4 contains only eight PEs. It shall be understood, however, that the present 
barrier mechanism and barrier synchronization circuits can be used with a system 
having any number of PEs. 
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Because the network in FIGURE 4 is accommodating 8 processors, 
PE0-PE7, and the AND-tree is implemented with 2-input AND gates, log 2 8=3 levels 
of AND-tree are required to arrive and a final barrier bit representing the logical 
product of all the PEs 1 barrier bits. 

As a starting condition, before barrier synchronization begins, bit 2 2 of 
BSRO 300 is reset to 0 in all of the PEs. When a processor satisfies the barrier 
condition, that processor sets bit 2 2 of its associated BSRO 300 register to 1. This 
action sends a 1 to one of the AND gates in the first layer of the AND-tree. 

The first layer of the AND-tree shown in FIGURE 4 contains four 
AND gates. Each AND gate receives signals from two PEs. For example, one AND 
gate may receive signals from bit 2 2 of BSRO 300 in PE 0 and bit 2 2 of BSRO 300 in 
PE 1. When all of the processors reach the barrier point in the program code (satisfy 
the barrier) they have each set bit 2 2 of their associated BSRO 300 to 1, causing the 
output of each of the four AND gates in the first level of the AND-tree to switch to 
1. 

The second level of the AND-tree of FIGURE 4 contains two AND 
gates. Each AND gate receives signals from two of the AND gates in the first level 
of the AND tree. When the output of all of the AND gates in the first level of the 
AND-tree are 1, the output of both the AND gates in the second level of the 
AND-tree are 1. 

The third level of the AND-tree contains the final AND gate. This 
AND gate receives signals from both AND gates in the second level of the 
AND-tree. When the output of both AND gates in the second level of the AND-tree 
are 1, the output of the final AND gate is 1. The output of the final AND gate sends 
an input to the fanout-tree circuit. The fanout tree circuit is used to report the result 
of the AND-tree back to PEs. 

The first fanout block in the fanout tree receives a 1 from the single 
level 3 AND gate. After creating two copies of the 1, the first fanout block sends 
the Fs to the two fanout blocks in the second level of the fanout tree. 
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The two fanout blocks in the second level of the fanout tree each 
create two copies of the 1. The two fanout blocks in the second level of the fanout 
tree then sends the l's to four fanout blocks in the first level of the fanout tree. 

The four fanout blocks in the first level of the fanout tree each create 

5 two copies of the 1. The fanout blocks in the first level of the fanout tree then send 
the l's to the eight PEs. This signals all of the PEs in the system that all of the 
processors have reached the barrier and the processor in the PE may continue with 
other program instructions. Because the fanout tree is dependent on the AND-tree, 
the fanout tree will report that all of the PEs have reached the barrier only when 

10 each PE in the system has set bit 2 2 to 1. 

As will be described more fully below in connection with FIGURE 14, 
the barrier mechanism is designed to be immediately reusable. This means that as 
soon as a processor detects that the barrier bit has cleared (all processors have 
arrived at the barrier), it can immediately set the barrier bit again to announce its 

15 arrival at the next barrier if applicable. Doing so does not affect the notification of 
the prior barrier satisfaction to any other PE. 

Eureka Synchronization 

A logical barrier synchronization circuit 400 may also be used to 

20 perform eureka synchronization. One use of the eureka function is for the 

synchronization of a distributed, parallel data search. Once one of the PEs has 
located the targeted information, the successful PE can set a eureka bit to inform the 
other PEs involved that the search is completed. 

To enable the eureka function, each PE contains a register called the 

25 barrier synchronization function register (BSFR). FIGURE 6 shows the format of the 
BSFR and how it functions in conjunction with the BSRO 300 and BSR1 301 
registers. The barrier synchronization function register (BSFR) is preferably a 16-bit, 
write-only, system privileged register. The BSFR contains a 16-bit mask that 
indicates which bits of BSRO 300 and BSR1 301 are used for barrier synchronization 

30 and which bits are used for eureka synchronization. Bits 0-7 of the BSFR control 
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the function of BSRO 300 and bits 8-15 of the BSFR control BSR1 301 . If a bit of 
the BSFR is set to 1, the corresponding bit of BSRO 300 or BSR1 301 is used for 
eureka synchronization. If a bit of the BSFR is set to 0, the corresponding bit of 
BSRO 300 or BSR1 301 is used for barrier synchronization. Table 3 shows the bit 
5 format of BSFR and describes each bit of the register. 



T able 3 
BSFR Format 

Bits Name 

10 2 7 -2° These bits indicate which bits of BSRO 300 are used for 

eureka synchronization. For example, as shown in Figure 
6, when bit 2 2 of the BSFR is set to 1, bit 2 2 of BSRO 
300 is used for eureka synchronization. When bit 2 2 of 
the BSFR is set to 0, bit 2 2 of BSRO 300 is used for 
barrier synchronization. 

2 ,5 -2 8 These bits indicate which bits of BSR1 301 are used for 

eureka synchronization. For example, as shown in 
FIGURE 6, when bit 2 13 of the BSFR is set to 1, bit 2 13 
of BSR1 301 is used for eureka synchronization. When 
bit 2" of the BSFR is set to 0, bit 2 13 of BSR1 301 is 
used for barrier synchronization. 

2 w -2 16 These bits are not used. 



15 Because each of the 16 logical barrier synchronization circuits operate 

completely independently of each other, any of the bits in each of the two barrier 
synchronization registers BSRO 300 and BSR1 301 for a particular PE can be 
programmed to function in the conventional barrier mode, or in eureka mode. As 
will be described below in connection with FIGURE 14, in eureka mode the output 

20 of the AND-tree is read directly by the PEs with no intervening synchronization 
logic such as the latch used in conventional barrier synchronization. 

Because in eureka mode the output of the fanout tree is read directly 
by the PEs y the barrier hardware is not synchronous as it is in conventional barrier 
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mode, nor is a eureka bit immediately reusable. Using a barrier bit in eureka mode 
requires the use of a second bit programmed to function in barrier mode in order to 
prevent race conditions. The conventional barrier is used to synchronize the entry 
and exit of the PEs into and out of a eureka barrier. This implementation means that 

5 only half of the 16 barrier bits can be used for eureka synchronization. 

Reading BSRO 300 or BSR1 301 returns the current state of the 
AND-tree. This allows the AND-tree to be used as an OR tree by using negative 
logic. After all PEs initialize their barrier bit to a logical 1, the output of the 
AND-tree can be read as a one by all PEs. If any PE writes a zero to its barrier bit, 

10 the output of the AND-tree will read as a zero, performing the OR function. 

A typical eureka sequence begins with all processors initializing the 
eureka bit to a logical 1 and setting the conventional barrier bit to a logical 1 . When 
the barrier switches to a zero (meaning all processors have initialized their eureka bit 
and joined the barrier), the eureka, bit is "armed". The processors then begin the 

15 parallel data search or other activity that is to be terminated by the eureka. When a 
processor satisfies the termination conditions, it clears its eureka bit to alert the other 
processors and sets its conventional barrier bit to indicate it has observed the eureka 
event. As each of the other PEs detect that the eureka event has occurred (because 
the output of the eureka AND-tree drops to a logical 0), it sets its conventional 

20 barrier bit and waits for the barrier to complete. When all PEs have joined the final 
barrier, they may proceed to arm the next eureka. 

Servicing A Barrier 

In the preferred embodiment of the present invention, the processor 
25 monitors the output of the fanout circuitry using one of two mechanisms: 

periodically testing the barrier bit (such as with a continuous loop) or enabling a 
barrier interrupt. 

Continuing to use bit 2 2 as an example, in the first mechanism, after 
the processor sets bit 2 2 of BSRO 300 to 1, the processor may enter a loop that 
30 continuously checks the value of bit 2 2 of BSRO 300. After receiving a 1 from the 
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fanout circuitry, the support circuitry in the PE resets bit 2 2 of BSRO 300 to 0. 
Because the processor is regularly checking the value of bit 2 2 of BSRO 300, the 
processor may continue executing program instructions as soon as it is detected that 
bit 2 2 of BSRO 300 is reset to 0. 

5 In the barrier interrupt mechanism, after the processor satisfies the 

barrier and sets bit 2 2 of BSRO 300 to 1, the processor enables a barrier interrupt. 
The processor may then issue program instructions that are not associated with the 
barrier. After receiving a 1 from the fanout circuitry, the support circuitry in the PE 
resets bit 2 of BSRO 300 to 0 and sets the barrier interrupt to the processor. The 

10 barrier interrupt indicates to the processor that all of the processors have reached the 
. barrier, and causes the processor to suspend the unrelated activity and return to 
executing instructions associated with the barrier. The advantage of the barrier 
interrupt over continuous polling is the ability to perform other useful work while the 
other processors are approaching the barrier. 

IS In the preferred embodiment, the microprocessor enables the barrier 

hardware interrupt using a hardware interrupt enable register (HIER) in the 
microprocessor system control register. For more information on the HIER and the 
system control register, refer to the DECChip 21064- AA RISC Microprocessor 
Preliminary Data Sheet, available from Digital Equipment Corporation, which is 

20 incorporated herein by reference. The DEC microprocessor has 6 inputs for external 
hardware interrupts. These inputs appear in the HIRR (Hardware Interrupt Request 
Register) in bit positions 5-7 and 10-12. One of these six inputs is designated as the 
Barrier Interrupt Request bit. 

The HIRR inputs can be enabled or disabled by a mask bit located in 

25 the HIER (Hardware Interrupt Enable Register) internal to the microprocessor. For 
more information on the HIRR and HIER registers, refer to pages 3-26 through 3-29 
in the Digital Equipment Corporation publication: EV-3 AND EV-4 
SPECIFICATION Version 2.0 May 3, 1991, which is incorporated herein by 
reference. 
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Those skilled in the art are aware that all RISC microprocessors 
provide comparable facilities for allowing and controlling the direct sampling of 
external hardware interrupt signals, and that the present invention is not limited to 
use with the DEC RISC microprocessor described herein. 

5 The interrupt input to the microprocessor is asserted whenever any of 

the barrier bits selected by the BSMI (barrier synchronization mask and interrupt, 
discussed below) make the transition from a logical 1 to a logical 0. The interrupt 
input to the microprocessor is cleared by writing a bit in the system control register. 
To assure that no satisfied barrier events are missed, the correct programming 

10 sequence would be to clear the interrupt, then read BSRO 300 and BSR1 301 to 
determine which bit(s) have toggled. 

After the support circuity sets the barrier hardware interrupt, the 
processor must read the BSMI register (as described below) to determine if the 
interrupt was associated with BSRO 300 or BSR1 301. When the processor reads the 

15 value of the BSMI register, the support circuitry clears the interrupt associated with 
BSRO 300 and the interrupt associated with BSR1 301. 

If a barrier interrupt occurs while the processor is reading the BSMI 
register, the interrupt still occurs and is not cleared. The processor must then read 
the value of the BSMI register again to determine if the interrupt was associated with 

20 BSRO 300 or BSR1 301 and to clear the interrupts. 



Logical Partitions 

Not all of the PEs in a multiprocessing system may need to be part of 
a barrier or eureka synchronization process. Also, it is often desirable to have 

25 several partitions of PEs operational on different tasks simultaneously. To facilitate 
this, each PE also contains a barrier synchronization mask and interrupt (BSMI) 
register which is used to enable or disable a logical barrier synchronization circuit for 
a PE. The BSMI register contains a mask that indicates which bits of BSRO 300 and 
BSR1 301 are enabled for the PE, thus defining which partition(s) a PE is a member 

30 of, and defining partitioning among the PEs. 



WO 95/16236 PCT/US94/14067 

17 

The 16 independent barrier synchronization circuits visible to the user 
in the barrier registers can be assigned by the operating system to different groups or 
teams of processors. The BSMI allows the barrier bits in any given processor to be 
enabled or disabled. Alone, the BSMI permits the division of the barrier resource 

5 among up to 16 different partitions with arbitrary membership. Used in conjunction 
with the barrier network partitioning capability described herein below, the BSMI 
allows the flexible subdivision of the barrier resource among a very large number of 
partitions in a scalable fashion. 

FIGURE 7 shows the format of the BSMI register. The BSMI is a 

10 16-bit, readable and writable, system privileged register. When written to, the BSMI 
register controls which bits in BSRO 300 and BSR1 301 are enabled for use by the 
PE. Bits 7-0 of BSMI enable the bits in BSRO 300, while bits 15-8 enable the bits 
in BSR1 301. If a bit of the BSMI register is set to 1, the corresponding bit of 
BSRO 300 or BSR1 301 is enabled. If a bit of the BSMI register is set to 0, the 

15 corresponding bit of BSRO 300 or BSR1 301 is disabled. 

The BSMI register has a different bit format when written to than it 
does when it is read from. Table 4 shows the bit format of the BSMI register when 
it is written to and describes each bit of the register. 

A disabled BSRO 300 or BSR1 301 bit appears to the logical product 

20 network to be always satisfied (a logical " F). This permits the barrier to function 
normally for other processors whose barrier bit is enabled. Reading a disabled 
barrier synchronization register bit returns a logical H 0 M . Writing a disabled barrier 
synchronization register bit has no effect. 
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IflhkJ 
BMSI Register Write Format 



Bits Name 

2'.2° These bits indicate which bits of 

BSRO 300 are enabled for use by the 
PE. For example, when bit 2 J of the 
BSMI register is set to 1, as shown in 
FIGURE 7, bit 2 J of BSRO 300 is 
enabled for use by the PE. When bit 
2 J of BSRO 300 is disabled and 
cannot be used by the PE. 

2 1 $.2' These bits indicate which bits of 

BSR1 301 are enabled for use by the 
PE. For example, when bit 2 ,} of the 
BSMI is set to 1, as shown in 
FIGURE 7, bit 2" of BSR1 301 is 
enabled for use by the PE. When bit 
2" of the BSMI is set to 0, bit 2 13 of 
BSR1 301 is disabled and cannot be 
used by the PE. 

2«.2 16 These bits are not used. 



Table 5 shows the bit format of the BSMI register when it is read from and 
describes each bit of the register. When read from, bits 2 14 and 2" of the BSMI 
register provide the current state of the barrier interrupts from BSRO 300 and BSR1 
301, respectively. After being read the BSMI register is cleared. 
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Table 5 
BSMI Register Read Format 

Bits Name 

2 ,3 -2° These bits are not valid. 

2 14 This bit reflects the cunrent state of 

the barrier interrupt associated with 
bits 2' through 20 of BSRO 300. 

2 15 This bit reflects the current state of 

the 

barrier interrupt associated with bits 2 
through 28 of BSR1 301. 

2 63 -2 16 These bits are not valid. 



10 Software may use the BSMI register to allow any set number of PEs 

to use one of the logical barrier synchronization circuits, and thus set up a partition 
among the PEs. For example, software may set bit 2 J of the BSMI register to 1 in 
only four of the PEs in an MPP system. In this case, only the four PEs with bit 2 2 
of the BSMI register set to 1 may use the logical barrier synchronization circuit 

15 associated with bit 2 2 of BSRO 300. This creates a logical barrier partition among 
the four PEs. 

The BSMI and BSFR registers can be used in concert to arrive at 
several different barrier synchronization mechanisms. Table 6 shows the effect of 
one bit from the BSMI register and the BSFR on the corresponding bit in BSRO 300 
20 orBSR1301. 
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Physical Barrier Synchronization Circuits 

Although each of the 16 bits in the BSRO 300 and BSR1 301 registers 
in each PE represent an input to one of 16 logical barrier synchronization circuits, 
the preferred embodiment of the present barrier synchronization mechanism does not 
5 contain 16 physical barrier synchronization circuits. Instead, in the preferred 

implementation, the system contains 4 physical barrier synchronization circuits. The 
16 bits of BSRO 300 and BSR1 301 are time multiplexed into the four physical 
barrier synchronization circuits. 

One exemplary physical barrier synchronization circuit is shown in 
10 FIGURE 5. The preferred barrier network is implemented using a radix-4 AND-tree 
. and fanout tree. As shown, a 1024 PE system contains log4 1024 = 5 levels in a 
barrier synchronization circuit. 

Table 7 shows the input to each of the four physical barrier 
synchronization circuits during each clock period (CP). Four CPs are required for 
15 the physical barrier synchronization circuits to receive the input from all 16 bits in 
BSRO 300 and BSR1 301. 

The input registers to the logical barrier synchronization circuits (as 
shown and described below in connection with FIGURE 14) are completely parallel, 
so any number of PEs can set barrier bits in the same clock period without 
20 contention. All PEs are informed that a particular barrier has been reached 

simultaneously, although due to the time multiplexed nature of the inputs to the 
circuits different processors may be at different points in a spin-loop testing the 
barrier bit. 
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Tab le 7 



Physical Bairier Synchronization 


Circuit Inputs 




First C 


Second C 


ThirH P 


Fourth C 


Bit 2° 


Bit 2 4 


Bit 2 ! 


Bit 2' 2 


ofBSRO 


ofBSRO 


of BSR1 


of BSR1 


Bit2' 


Bit 2 s 


Bit 2' 


Bit 2 13 


of BSRO 


of BSRO 


of BSR1 


of BSR1 


Bit2 2 


Bit2 6 


Bit 2'° 


Bit 2 14 


of BSRO 


ofBSRO 


of BSR1 


ofBSRl 


Bit2 3 


Bit2 7 


Bit 2" 


Bit 2' 5 


of BSRO 


ofBSRO 


of BSR1 


ofBSRl 



Circuit 
0 

l 

2 
3 



10 With the preferred radix-4 AND-tree implementation such as that 

shown in FIGURE 5, each level of the tree takes approximately one clock period in 
logic and another 1 to 1.5 clock periods in wire travel time to accomplish. This 
assumption allows the latency for a barrier between 1024 processors to be estimated 
and compared with known latencies for software barrier techniques. 

15 Table 8 illustrates the number of barrier bits and the estimated number 

of clock periods at each level of a radix-4 tree such as the one shown in FIGURE 5 
connecting 1024 PEs. The radix-four tree reduces the barrier bit count by a factor of 
four in each level. 
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Barrier bit logical product tree levels 

Number of 

barmr Mfe Radix-four topical product tree level 

5 1024 Level one, clock period 0 

256 Level two, clock period 2 

64 Level three, clock period 4 

16 Level four, clock period 6 

4 Level five, clock period 8.5 

(1.5 clock wire) 

.10 1 Level six, clock period 1 1 

(1.5 clock wire) 

From Table 8 it can be seen that eleven clock periods are consumed to 
perform the logical product of the barrier bits from 1024 PEs using a radix-four tree 
with two to two-and-a-half clock periods per level of delay. If it is further assumed 
15 that the necessary fan-out of the final logical product all 1024 PEs was performed 
using a series of one-to-four single-clock fan-outs, another eleven clock periods 
would elapse, bringing the total latency time of a barrier propagation to 22 clock 
periods. 

From this, the projected performance impact of the radix-4 tree 
20 implementation is relatively straight-forward to predict With a time multiplexed tree 
that is four bits wide, 4 cycles of the tree are required to update the values of all 16 
barrier bits. This is in addition to the approximately 22 clocks of initial latency. 
The best case occurs when the last processor sets a barrier bit at the same time the 
4-bit "slice" that bit is a member of is entered into the logical product network (refer 
25 again to Table 7), adding nothing to the tree latency of 22 clocks. The worst case 
occurs when the last processor just misses the best case, setting a bit one clock 
period after the slice that bit is a member of was entered into the network. When 
this happens, the slice must wait 4 clocks to be entered into the network and another 
22 for the propagation delay of the network. The final result is a minimum time of 
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22 cycles, maximum of 26, and an average of around 24 cycles. The resulting delay 
is about 160 nanoseconds at a 6.6 nanosecond clock period, assuming that the barrier 
network is clocked at the same speed as the processor. This delay is several orders 
of magnitude faster than known software barrier implementations, which can 
5 approach 500 microseconds or higher. This significant advantage of the present 
barrier synchronization mechanism will be well appreciated by those of skill in the 
art. 



Physical Partitions 

10 Note in FIGURES 4 and 5 that the "shape" of the fanout tree closely 

matches the configuration of the AND-tree. In the preferred embodiment, the fan-in 
nodes performing the AND function are physically located on the same integrated 
circuits as the fanout blocks at each level of the tree. An AND gate and fanout 
block pair is called a bypass point. This advantage of location makes possible 

15 programmable subdivision, or physical partitioning, of each barrier synchronization 
network into independent subnetworks by rerouting the output of the AND gate in 
each bypass point. 

FIGURE 8 shows bypass points 302, 304, 306, 308, 310, 312 and 314 
in a simplified radix-2 barrier synchronization circuit. In the preferred embodiment, 

20 the output of the AND gate in each of the bypass points can be redirected so that the 
output of the AND gate connects to the fanout block in that bypass point. For 
example, FIGURE 9 shows the same barrier synchronization circuit as shown in 
FIGURE 8. However, the output of the AND gate in bypass points 306, 308 and 
310 is redirected to the fanout block in those bypass point. This results in a 

25 partitioning of the physical barrier synchronization circuit into three smaller barrier 
synchronization circuits, designated by phantom lines 402, 404 and 406. 

The first circuit designated by phantom line 402 contains a two-level 
AND-tree and two-level fanout tree. This circuit is created by redirecting the output 
of the AND gate in bypass point 316 to the fanout block in bypass point 0. This 

30 smaller circuit operates identically to the barrier synchronization circuit in FIGURE 
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8. However, this circuit receives an input from and sends an output to PEs 0 
through 3 only. 

The second and third circuits designated by phantom lines 404 and 
406, respectively, each contain a one-level AND-tree and fanout tree. These circuits 
5 are created by redirecting the output of the AND gates in bypass points 306 and 308 
to the fanout blocks in bypass points 306 and 308. 

The bypass mechanism just described can be used to create several 
types of banier partitions. FIGURE 10 shows the resulting partitions when the PEs, 
in a 1024-PE MPP system using a radix-4 AND-tree and fanout tree are partitioned 
10 at each level of a physical barrier synchronization circuit. Each bypass point in level 
1, of a physical barrier synchronization circuit connects to four PEs for a total of 256 
level 1 bypass points: If all 256 of these bypass points have the output of the AND 
gate redirected to the fanout block, the PEs in the system are divided into 256 4-PE 
partitions, as shown in FIGURE 10A. 
15 If all 32 of the level 2 bypass points have the output of the AND gate 

redirected to the fanout block, the PEs in the system are divided into 64 16-PE 
partitions, as shown in FIGURE 10B. 

If all 8 of the level 3 bypass points have the output of the AND gate 
redirected to the fanout block, the PEs in the system are divided into eight 128-PE 
20 partitions, as shown in FIGURE IOC. 

If all 4 of the level 4 bypass points have the output of the AND gate 
redirected to the fanout block, the PEs in the system are divided into four 256-PE 
partitions, as shown in FIGURE 10D. 

If both of the level 5 bypass points have the output of the AND gate 
25 redirected to the fanout block, the PEs in the system are partitioned as one 1024 PE 
partition, as shown in FIGURE 10E. 

The partitions shown in FIGURE 10 are only a subset of those that 
can be achieved using the bypass mechanism of the present barrier invention. 
FIGURE 10 shown only those achieved by redirecting all of the bypass points at 
30 each level. However, any number of bypass points in any number of different levels 
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can be redirected to achieve a wide variety of bypass partitions. Thus, many 
different partitions of different sizes can be created simultaneously, in the manner 
such as that shown in FIGURE 9. The result is a very flexible and scalable barrier 
resource. It shall therefore be appreciated by those of skill in the art that different 
5 partitions within the barrier mechanism can have bypass points enabled at different 
levels, and that sub-partitions of the partitions can have their barrier circuits bypassed 
at different levels, etc., conferring a great degree of flexibility of final partition 
subdivision upon the barrier tree. 

When a bypass point of a barrier synchronization circuit has the output 
10 of the AND gate redirected to the fanout block, the PEs in the barrier partition can 
still use all 16 bits in BSRO 300 and BSR1 301; however, because there are only 
four physical barrier synchronization circuits, creating a barrier partition affects the 
four of the 16 bits in BSRO 300 and BSR1 301 which are input to the barrier 
synchronization circuit. For example, if a level 4 bypass point in physical barrier 
15 synchronization circuit 2 has the output of the AND gate redirected to the fanout 
block, the barrier partition contains 256-PEs. Because of this barrier partition, bits 
2 1 , 2 6 , 2'°, and 2 M of BSRO 300 and BSR1 301 (the bits that are time multiplexed 
into the same physical barrier synchronization circuit as described above with respect 
to Table 7) in each of the 256-PEs that are part of the barrier partition only affect the 
20 256-PEs in the barrier partition. In the 256-PEs, these bits cannot be used for 
system-level barrier or eureka synchronization, but can be used only for 
partition-level barrier or eureka synchronization for the PEs in the partition. This is 
because the other 768 PEs in the 1024-PE system are not part of the partition created 
and thus barrier communication amongst all PEs is not available in that instance. 
25 Note that, unlike barrier partitions created using the BSMI register 

described above, each of the PEs in a partition created when the barrier network is 
bypassed still is able to use all 16 inputs of the BSRO 300 and BSR1 301 registers, 
just as in the case of an undivided network. Each PE has all 16 barrier bits available 
to it because none are being masked to create the partition. This gives the 
30 characteristic of scalability to the barrier resource: each sub-network created through 
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bypassing the barrier circuits is an independently functioning copy of a barrier bit. 
Thus, the same bit in the barrier register can serve many separate partitions. 

A further advantage of partitions created when the network is bypassed 
in a manner such as that shown in FIGURES 9 and 10 is that the latency of a barrier 
5 is reduced as levels of the fan-in/fan-out tree are avoided. The smaller the partition, 
the lower the latency. 



Flexible Barrier Resource 

As is well known by those skilled in the art, various processing tasks 

10 require partitions of particular sizes and shapes to achieve the most efficient 

performance. By redirecting the output of AND gates in different level bypass 
points, a physical barrier synchronization circuit may be divided into any 
combination of the previously described barrier partitions. However, the number and 
shape of the PEs in the partitions achieved with physical barrier partitioning alone 

15 may not exactly match the size and shape of partition desired. 

The barrier network subdivisions are arranged so that they fall on 
power-of two boundaries coincident with the logical partitions created when the 
preferred 3-D torus interconnect network is subdivided. The 3-D torus interconnect 
network can be divided by two in any dimension, raising the probability of 

20 mismatches between the desired partitioning of the interconnect network and the 
partitioning of the barrier network achievable by means of the network bypass 
mechanism. In these cases, the BSMI register is used in concert with the physical 
barrier partitioning mechanism to accomplish the partitioning along the boundaries 
desired. The operating system can use the BSMI register to disable some of the bits 

25 in BSRO 300 and BSR1 301 for some of the PEs and enable these bits for the other 
PEs in the barrier partition to arrive at almost any desired group of PE partitions. 

For example, it may be desirable to logically divide the 512-PE system 
and the associated 8*8*8 3-D torus network shown in the upper left of FIGURE 1 1 
into eight equal-sized 64-PE partitions with dimensions 4*4*4 shown in the lower 

30 left of FIGURE 11. Because of the radix-4 implementation of the barrier 
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synchronization, the barrier synchronization circuits can only be divided by 4 using 
the bypass mechanism and the division is fixed in the dimensions that are affected. 
For example, the level 5 barrier synchronization circuit subdivision using the physical 
barrier partitioning will split the PEs into four 128-PE partitions each dimensioned 

5 4x4x8 as shown at the top center of FIGURE 1 1 . This is an insufficient level of 
subdivision to match the desired interconnect network partitioning already described. 
However, if the next level of physical barrier network partitioning is activated, it will 
split each partition by 4 again, into sixteen 32-PE partitions each dimensioned 4x2x4 
as shown at the far right of FIGURE 1 1 . This results in too much partitioning to 

10 match the desired group of eight 64-PE partitions. 

To achieve the desired degree of partitioning, the barrier 
synchronization circuits are first physically partitioned using the level 5 bypass 
mechanism to achieve four 4x4x8 partitions. The final division of the by-8 
dimension is accomplished using the appropriate bits of the BSMI register. To 

1 5 achieve this result, the appropriate bit of the BSMI register of half of the PEs are set 
to 0, while the remaining half have their corresponding BSMI bit set to 1 . This 
results in dividing the partitions in half to arrive at the desired partitioning of that 
shown in the bottom center of FIGURE 11, which is a match for the desired 
partitioning shown in the lower left of FIGURE 1 1. 

20 

Bypass Circuit Implementation 

FIGURE 12 shows a block diagram of a hardware implementation for 
a radix-4 bypass circuit. The four inputs from the previous level of bypass circuits, 
or from the BSRO 300 or BSR1 301 registers if the bypass circuit is a level 1 bypass 
25 circuit, are input to four-input AND gate 350. The AND function is performed and 
the result output to the A input to mux 354 and to the B input to mux 356, which 
together form a bypass switch 355. 

The "B n input to mux 354 is hardwired to a logical T. The A input 
to mux 356 is received from the next level of fan-out in the fan-out tree. 
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For normal barrier synchronization circuit operation, select signal "S" 
is set such that the A input to both mux 354 and mux 356 are selected. If the output 
of AND 350 is to be redirected to fan-out circuit 352, select signal "S" is set such 
that the B inputs to mux 357 and mux 356 are selected. Thus, the result of the AND 
5 350 is selected by mux 356. In this manner, the physical barrier partitioning 
achieved by "short-circuiting" the bypass circuits is accomplished. 

However, if the output of the AND is to be redirected to the fanout 
block, the select line of mux 356 will select such that the AND output transmitted to 
the fanout circuit 352 by mux 356. 

10 

. Timing 

Because a physical barrier synchronization circuit may be divided into 
partitions, and because the 16 barrier bits are actually implemented by 
time-multiplexing on 4 physical barrier circuits, the total time for a bit to propagate 

15 through the circuit is not constant. This timing inconsistency may cause a bit to be , 
in the wrong position when BSRO 300 or BSR1 301 is read. For example, if the 
timing is not set up correctly for physical barrier synchronization circuit 0, a bit that 
was originally written to the 20 bit location in BSRO 300 may appear in the 2 4 bit 
location of BSRO 300 when BSRO 300 is read (refer again to Table 7). 

20 Because the barrier synchronization circuits are time multiplexed, and 

each bit in a barrier register is updated only every four clock periods, it is necessary 
for the depth of the pipelined fan-in/fan-out tree to be a multiple of four clock 
periods deep regardless of how it is configured or bypassed. When a physical barrier 
synchronization circuit is divided into barrier partitions, the time needed for a bit to 

25 propagate through each physical barrier synchronization circuit is not consistent. 

Since bypassing a level can remove a non-multiple-of-four number of clock periods 
of circuitry, it is necessary to have some programmable delays to adjust the apparent 
depth of the barrier network pipeline to a multiple of four after a bypass is 
implemented. These programmable delays, capable of adding from 0 to 3 clocks of 

30 delay, are located at each PE and skew the signal arriving from the barrier network 
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for each of the 4 physical circuits. The delays are programmed by writing an 8-bit 
memory-mapped register called the Barrier Timing register (BARJTMG). FIGURE 
13 shows the bit format of the BARJTMG register. This register is organized as 
four groups of two bits, each of which represents a delay from 0 to 3 clocks to be 
5 added to one of the four physical barrier synchronization circuit outputs to correct 

the pipeline depth. 

The BARJTMG register is an 8-bit, write-only, system privileged 
register. The BARJTMG register controls the timing of each physical barrier 
synchronization circuit. Table 9 shows the bit format of the BARJTMG register. 



10 



Table 9 
BAR TMG Bit Format 



15 



Bits 

2'-2° 
barrier 



Name 

These bits control the timing for 
physical barrier synchronization 
circuit 0. 



2 3 -2 2 
barrier 



These bits control the timing for 
physical barrier synchronization 
circuit 1. 



2 5 -2 4 
barrier 



These bits control the timing for 
physical barrier synchronization 
circuit 2. 



20 



2 7 -2 6 
barrier 



These bits control the timing for 
physical barrier synchronization 
circuit 3. 



2 63 -2 B 



These bits are not used. 



As an example, the following procedure sets the BARJTMG timing value for 
physical barrier synchronization circuit 0 in the PEs of a barrier partition. Before 
performing the procedure, Barrier hardware interrupt to the processors should be 
disabled. 
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1 . Write a value of 1 1 1 1 16 to the BSMI register in all of the PEs to 
enable barrier bits 2 12 , 2 8 , 2\ and 2°. 

2. Write a value of 11 1 1 16 to the BSFR in all of the PEs to set barrier 
5 bits 2 12 , 2 8 , 2\ and 2° to eureka mode. 

3. Write a value of 1 1 1 1 16 to BSRO 300 and BSR1 301 in all of the PEs 
to start the eureka processes. 

10 4. In one of the PEs write a value of 11 10 16 to BSRO 300 to indicate that 

bit 20 has completed a eureka. 

5. In each of the PEs, read the value of BSR1 301 which contains the 
value of all 16 barrier bits and apply a software mask so that the only bits 
15 read are 2 12 , 2 B , 2\ and 2° and the remaining bits are set to 0. The value read 

from BSR1 301 may be 1110 I6f 1101 l6 , 1011 l6 , or 0111 16 . If the value is not 
1110, 6> increment bits 2 1 through 2° of the BARJTMG register by 1 and read 
the value of BSR1 301 again. 

20 6. If the value read from BSR1 301 is now 1 1 10, 6 , the timing is set up 

correctly. If the value is not 1 1 10 l6 , increment bits 2* through 2° of the 
BARJTMG register by 1 and read the value of BSR1 301 again. This 
process must be repeated until the value read from BSR1 301 is 1110, 6 , but 
should not need to be performed more than three times. This is because if it 

25 is correct, the value need not be incremented at all, and if it is not correct, 

incrementing by three will cause the BSR1 301 to run through all four 
possible values. 
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This procedure may be used to set the timing for any of the physical 
hairier synchronization circuits in a barrier partition. Table 10 lists the barrier bits 
affected and the write pattern for each physical barrier synchronization circuit. 



5 Table 10 

Timing Procedure Values 

0 2» 2\ 2\ and 2° 1110 16 

1 2 ] \ 2\ 2\ and 2 1 2220 16 

2 2 14 , 2 ,0 S 2 6 , and 2 2 4440 l6 
10 3 2", 2 n , 2 7 , and 2 3 8880 >6 



Barrier Synchronization Register Implementation 

A single bit of the barrier registers BSRO 300 or BSR1 301, BMSI 
and BSFR associated with a PE is illustrated in FIGURE 14. As stated above, all 16 

15 bits in the preferred barrier register function identically. For simplicity of 

illustration, only the circuitry associated with one bit is shown. Also, note that the 
barrier network is not shown, nor is the circuitry used to interface the barrier 
registers to the microprocessor. 

In FIGURE 14, multiplexors are market "MUX" and have a select 

20 input M S" and a pair of data inputs "1" and "0" that are gated to the output depending 
on the state of the M S" input. Latches have a data input "D" and a clocked output 
"Q" and may have an enable input M E M that controls when new data is to be captured. 
If a latch does not have an "E" input, then new data is captured every clock. 

The three principle barrier registers are identifiable as the mask latch 

25 802 (for BSMI), the fiinction latch 804, and the barrier register itself, latch 808 (for 
BSRO 300 or BSR1 301) . Data from the microprocessor can be entered into any of 
these latches via the processor writing to special memory-mapped addresses. Control 
hardware decodes the address to generate one of the control signals LD_MSK, 
LD_FCN, or LD_BAR. LD_MSK enables data to be entered the mask latch 802, 

30 LD_FCN enables data to be entered the function latch 804, and LD_BAR controls 
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the input multiplexor 806 that gates data into barrier register latch 808 every clock 
period. Data in the mask 802 and function 804 latches is held in the absence of the 
enable signals, while the barrier latch hold path is through multiplexors 810 and 806. 

If mask latch 802 has been set to a 0 (disabling the barrier bit), data is 
5 inhibited from entering the barrier register latch 808 by forcing a constant 1 via OR 
gate 805. This forces a constant 1 into the barrier network as well, allowing other 
selected bits at other PEs to function normally. A 0 in the mask latch 802 also 
forces a 1 into the receive latch 812 through OR gate 811. This disables the local 
processor from "eavesdropping" on the barrier activities of other PEs while 

10 deselected and prevents spurious barrier interrupts from deselected barrier bits. 

Function latch 804 controls multiplexors 814 and 810. If the function 
latch 804 is set to 0 (barrier mode), mux 814 delivers the current state of the barrier 
register while mux 810 causes the barrier register latch to remain set if it is currently 
set as long as the incoming barrier state is still a zero (i.e., the barrier has not been 

IS satisfied). When the incoming barrier state switches to a 1, indicating that all PEs 
have set their barrier bits, then the barrier register hold path is broken by AND gate 
816 and barrier register 808 reverts to a 0 until it is set again by the processor. 
Thus, the barrier mode is a synchronous and immediately reusable mode. 

If function latch 804 is set to a 1 (eureka mode), mux 814 

20 continuously gates the incoming state of the barrier tree to the processor through 811 
and 812 and, while mux 810 causes the barrier register latch to remain at whatever 
state, 0 or 1, that the processor sets it to. Thus, the eureka mode is asynchronous: 
any changes to the barrier registers flow through the barrier tree and are directly 
sampled by the processors, allowing the AND-tree of the barrier synchronization 

25 circuit to be used as an OR-tree for eureka barriers. 

Although specific embodiments have been illustrated and described 
herein for purposes description of the preferred embodiment, it will be appreciated by 
those of ordinary skill in the art that a wide variety of alternate and/or equivalent 
implementations calculated to achieve the same purposes may be substituted for the 

30 specific embodiment shown and described without departing from the scope of the 
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present invention. Those of skill in the electrical and computer arts will readily 
appreciate that the present invention may be implemented in a very wide variety of 
embodiments. This application is intended to cover any adaptations or variations of 
the preferred embodiment discussed herein. Therefore, it is manifestly intended that 
this invention be limited only by the claims and the equivalents thereof. 
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What is claimed is; 

1. A computer system (100) comprising: 

a plurality of processing elements (PEs) (200) including a first processing 
5 element (PE) (200), wherein said first PE (200) comprises: 

a barrier synchronization register one (BSR1) (301), said BSR1 
(301) including a first BSR bit (340) and a second BSR bit (342), 
wherein said first BSR bit (340) and said second BSR bit (342) are set 
when said first PE (200) reaches a first barrier point and a second 
10 barrier point, respectively; 

a first BSR output (401); and 

a coupler which couples a value representative of said first 
BSR bit (340) to said first BSR output (401); and 
a barrier synchronization circuit (BSC) (400) having a first physical barrier 
15 synchronization circuit (PBSC) (402), wherein said PBSC (402) comprises a plurality 
of bypass points including a first bypass point (302) and a second bypass point (310), 
wherein each of said bypass points comprise: 

a fanin gate (330) having a plurality of fanin inputs and a fanin 
output, wherein said fanin output generates an indication of whether 
20 said fanin inputs are all set; 

a fanout circuit (331) having a fanout input and a fanout 
output; and 

a bypass switch (355) for switchably coupling said fanin output 
to said fanout input; 

25 wherein one of the fanin inputs of said first bypass point (302) is coupled to 

said first BSR output (401), wherein one of the fanin inputs of said second bypass 
point (310) is coupled to the fanin output of said first bypass point (302), wherein 
the fanout input of said second bypass point (310) is coupled to the fanin output of 
said second bypass point (310), wherein the fanout input of said first bypass point 
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(302) is coupled to the fanout output of said second bypass point (310), and wherein 
the fanout output of said first bypass point (302) is coupled to said first PE (200). 

2. The computer system according to claim 2, wherein 

5 said coupler comprises a time multiplexor (TM), wherein said TM time- 

multiplexes said first BSR bit (340) and said second BSR bit (342) on said first BSR 
output (401). 

3. The computer system according to claim 2, wherein said BSC 400 further 
10 comprises a second PBSC (402), wherein said second PBSC (402) comprises: 

a third bypass point (302) and a fourth bypass point (310), wherein each of 

said bypass points comprise: 

a fanin gate (330) having a plurality of fanin inputs and a fanin 
output which generates an indication of whether said fanin inputs are 
15 all set; 

a fanout circuit (331) having a fanout input and a fanout 
output; and 

a bypass switch (355) for switchably coupling said fanin output 
to said fanout input; 

20 wherein said first PE (200) further comprises a second BSR output (401); 

wherein said coupler further couples a value representative of said second 
BSR bit (342) to said second BSR output (401); and 

wherein one of the fanin inputs of said third bypass point (302) is coupled to 
said second BSR output (401), wherein one of the fanin inputs of said fourth bypass 

25 point (310) is coupled to the fanin output of said third bypass point (302), wherein 
the fanout input of said fourth bypass point (310) is coupled to the fanin output of 
said fourth bypass point (310), wherein the fanout input of said third bypass point 
(302) is coupled to the fanout output of said fourth bypass point (310), and wherein 
the fanout output of said third bypass point (302) is coupled to said first PE (200). 
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4. The computer system according to claim 2, wherein said BSC 400 further 
comprises a second PBSC (402), wherein said second PBSC (402) comprises: 

a third bypass point (302) and a fourth bypass point (310), wherein each of 
said bypass points comprise: 
5 a fanin gate (330) having a plurality of fanin inputs and a fanin 

output which generates an indication of whether said fanin inputs are 
all set; 

a fanout circuit (331) having a fanout input and a fanout 
output; and 

10 a bypass switch (355) for switchably coupling said fanin output 

to said fanout input; 
wherein said BSR1 (301) further includes a third BSR bit (341) and a fourth 
BSR bit (343) which are set when said first PE (200) reaches a third and a fourth 
barrier point, respectively; 

15 wherein said first PE (200) further comprises a second BSR output (401); 

wherein said coupler comprises a time multiplexor (TM), wherein said TM 
time-multiplexes said first BSR bit (340) and said second BSR bit (342) on said first 
BSR output (401), and wherein said TM time-multiplexes said third BSR bit (341) 
and said fourth BSR bit (343) on said second BSR output (401); and 

20 wherein one of the fanin inputs of said third bypass point (302) is coupled to 

said second BSR output (401), wherein one of the fanin inputs of said fourth bypass 
point (310) is coupled to the fanin output of said third bypass point (302), wherein 
the fanout input of said fourth bypass point (310) is coupled to the fanin output of 
said fourth bypass point (310), wherein the fanout input of said third bypass point 

25 (302) is coupled to the fanout output of said fourth bypass point (310), wherein the 
fanout output of said third bypass point (302) is coupled to said first PE (200). 

5. A computer system according to claim 2, wherein the fanout output of said 
first bypass point (302) generates an interrupt to said first PE (200). 
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6. A computer system according to claim 2, wherein the fanout output of said 
first bypass point (302) controls a value which is polled by said fust PE (200). 

7. A method for barrier synchronization on a computer system, said computer 
5 system having a plurality of processing elements (PEs) (200) including a first PE 

(200), and a physical barrier synchronization circuit (402) including a plurality of 
bypass points (302, 310) each having a fanin gate (330) and a fanout circuit (331), 
wherein said bypass points are arranged in a tree having a plurality of levels, each of 
said plurality of PEs (200) including a barrier synchronization register one (BSR1) 
10 (301) and a first BSR output (401), each said BSR1 (301) having a first BSR bit 
(340) set when its associated PE (200) reaches a first barrier point, the method 
comprising the steps of: 

outputting a value representative of the first BSR bit (340) on the first BSR 

output (401) of each of said PEs 200; 
15 fanning-in a value from the first BSR outputs (401) from a first plurality of 

said PEs (200) to generate a first barrier-completion signal (403); 

selectively bypassing said first barrier completion signal (403) at one of the 
plurality of levels from the fanin gate (330) to the fanout circuit (331) in order to 
partition the tree; and 

20 fanning-out said first barrier completion signal (403) to each of said first 

plurality of PEs (200). 

8. A method according to claim 8, wherein each said BSR1 (301) further 
includes a second BSR bit (342) set when its associated PE (200) reaches a second 
25 barrier point, wherein: 

the step of outputting comprises the step of time multiplexing said first BSR 
bit (340) and said second BSR bit (342). 



30 



9. A method according to claim 8, wherein each said BSR1 (301) further 
includes a second (342), third (341), and fourth BSR bit (343) set when its associated 
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PE (200) reaches a second, third and fourth barrier point, respectively, and a second 
BSR output (401), wherein the step of outputting comprises the steps of: 

time multiplexing said first BSR bit (340) and said second BSR bit (342) to 
said first BSR output (401); and 
5 time multiplexing said third BSR bit (341) and said fourth BSR bit (343) to 

said second BSR output (401). 



10. A system for processor barrier synchronization of a plurality of processing 
elements (PEs) (200) in a distributed processing computer (100), comprising: 
10 a barrier detection program means operative in each PE (200) to determine 

. when each individual PE (200) has reached a processor barrier; 

a plurality of boolean barrier synchronization register (BSR) outputs (401) 
from each PE (200), each output (401) controlled by the barrier detection program 
means, indicating when a processor barrier was reached by that PE (200); 
IS a plurality of physical barrier synchronization circuits (402) for detecting 

which PEs (200) have reached a particular processor barrier, comprising: 

a plurality of boolean logical AND gates (330) connected to 
corresponding BSR outputs (401) of the plurality of PEs (200) to 
indicate when all of the PEs (200) reach a processor barrier; 
20 an ALLDONE signal output (403) from the plurality of AND 

gates for providing an output indicating that all of the PEs (200) have 
reached a particular processor barrier, and 

a plurality of logical fanout devices (331), with inputs 
connected to the ALLDONE signal line and outputs connected to the 
25 multiplicity of PEs (200) to indicate to every PE (200) in the system 

that the particular processor barrier was reached; and 
a plurality of barrier synchronization register bits (340, 341) for storing the 
status of each barrier synchronization process. 
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TURNING EUREKA STEPS INTO CALCULATIONS IN AUTOMATIC 
PROGRAM SYNTHESIS ' 

A Bundy, A Small] Mid J Hesketh 
University of Edinburgh, UK 



Abttrect 



W« daacrlba a technique called wuddlt-4ui natem'st for 
the comtroi of March fa eatomatk theorem proving. Wo 
Ufeatrata tta eat to tht domela of eatomatk program ■ya- 
thetk. Programs caa bo lyntbctieed from proofr that their 
kgicaltpeclAcaUoataraiatkftabla, Each proof ttep k ako 
a program cooatrtctka *Up. Uafortaaataly, a aarra soa 
of thb techaiqat reqelrco a tanas or computer to pro- 
duct proof §ttp* whkh provide tha otatatlal •tracturo of 
tha dcelrcd program. II b hard to eet the juttlacetfea for 
theoc rtapt at tat time thai they art made; the roam 
for thorn emergeo emly later hi the prool Sacs proof ttopo 
art oftoa call 'eureka' ttopo. Middle-out rot ton fog eaabko 
thoto oartha ttapo to bo prodaced, aetomatkaUy, at a tldo 
of act of aoa-caren. ttcpe. 



1 INTRODUCTION 

Wt daKribo a tachnique calltd wuddlt-rvt riereninf for 
tht control of ttarch daring automatic program cyatheeb. 
Compottr program ryatheak It coadactod by oar Oyttar 
program. Bora |1], which wat baatd oa tht Coraett UaV 
▼arttty, Naprl eyttam, CoaatabJo ft ai (J). To tyatkeoka 
a program oao providte a logical tpecucatka dttcrfbmg 
a rakUoe, tptc(fapair a oaipaf) ( batwoaa tht lapute aad 
oatpati of tha propoeed program. Oyttar It thaa atod at 
aa tatmctlve theorem provcr to prova a coajoetaro of tht 
form: 

VinpmU Bomtpmt tpu(mfU t output) (l) 

la a logk baatd oa Mertto-Lftf Iataltloakt Typo Thaory*, 
laartto-Lftf (*]. Tok logk b cotutnutit, to. tha proof 
that aa eatpsf cxkte for aay combination of tmjmtt matt 
abo thow how to ctmwacf tha output from tha input*. 
Tab oaattractloa raclpt caa bo extracted from tha proof 
aad taraod la to a compatar program. 

Oy ttar procotdt by a proems of backwardt roaaoalag from 
tht goal. Ataodatod with aach of IU backward* proof 



•Tba rwaarcb rtporftad la thb aapar wat tappetied by 
SZRC (rut OR/B/44IM aad aa 1ERC Saaior rtUewaMp to 
thomat author. Wa art grattfaf for dheattbat with tha otbar 
■wmbiH of tha am thwn o t k a l rwoo ntog troop at aVttnbarab. 

»Bowtw, to tha iotareste of widar readability wt hava 
tranrfetod tha haartto-Lof tangaef tato o mow coarantioBal 
aotatioa in tha example* frr«o balow. 



lUpt b a program contraction ttep. Ai tht proof pro- 
cotdt a faactiooal/logk program, prot(tapaU), whkh b 
abo oxprtoatd la tht Typo thaory logk, b coaatrectad. 
Thb program maott tht cpeclscatloB: 

Viayaie. $p*c{i*puU,profii*puU)) 

Thtra b a daallty bttwota proof tttpt aad program coa- 
ttractloa tttpt, «.y. aa todacUva proof prodacat a ra- 
coraWa program. Sine* rtcartlos b parratffo la font* 
tkaal/logk program* wt art partkalarly tottreated la 
proof* by mathematical todeetiem. Sach proof* prtttnt 
atpadally dimes* problamt of ttarch coatrol, aamcjy to 
tht choice of hdactkm rah aad todaeUoa veri*bk(t). 
To eatomata thb procaat of program tyathotb wt hart 
ballt a program Clara, vea Htrmoba which goidtt 
Oyttar Is Ite ttarch for a prool Clam coataiat a celiac. 
Uoa of UttUA. Tatet art compatar progremt which apply 
Oyttar relet, aad haaca direct Ita ttarch. Clam aaalyait 
tha coajeetare to ba proved aad aeta AI piaaatog tech- 
aiqaet to coattroct a ptm/ eta, which b a tactk atpa- 
cially detlgaed for tht carrtat coaiectart. Farther dataik 
of the Oytter-Ckm tytttm caa be band to Baady if ai 

14 

In OytterH fcgk, a coajoetaro of form (1) k otaaUy proved 
by eliminating ha qeuatifian at aome etage, to form a goal 
of the form: 

$pte(imput$, ptop{inpuU )) (1) 

where troy {rajraU) b eaUad tht witruu of the cmtteatlaUy 
qaaaUfied variabb safest . Thb rather defeate tht object 
of the txerclea At caa be etea la goal (J), U b otctttary 
for the Oyttar atef to prtwid* the very program whkh 
Oyttar b tappoted to be ryathaalatog. Brta If the proof 
b tret divided la to cacct aad the qaantlfiare ellmtoated 
atpaxataty la aach cate, at to J 8, the combiaattoa of tht 
etparato wkaeetee ttiU dt&aet tht program. 
The root of the proof coattltatea a mere vtrifkricn that 
thb program meete the epedlcatlos, rather thaa a «yn» 
iktwu of tha program. Thb afimiaatloa of aa exkteatial 
Cioaatifltr b as eaample of a •arrieattp, i.t> a tttp whott 
Jojtlleatloa k act appareat at the time It b made. It be- 
come* tvid tat later to the proof, of count, whea the veri- 
ficattoa taceaedt. Bartka ettpt preeeat a probbm for both 
hamaa aad compatar theorem provtog. la either cate It 
b hard to tee how they might be thought of, t.f. what 
tactk might be Impkmeatad to make eareka tttp*. 
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Another txnopb of a eureka atop It the choke of appro- 
pxUu Ududioa variabbaawi mductioa rait of mfcreacu. 
Theee will determine the forme of recareloa to b« need by 
tho program- 



3 PROGRAM SYNTHESIS BY THEOREM 
PROVING 



To IBurtimU Iks tacknkiue of program ayiUnb end the 
tnrekn itepe H requlree, we will ehow how a program, 
/acttft(a), far f a ctorbJag » pool the integer, e, Into iU 
prime factort, caaboeya thoab o d from ft proof of tho prima 
factorinetloa theorem. Tho primo mctorisatioa theorem 
cam b« expTOOoed la EogUth ok 

Tor aU poeitire lategero, », there ba tfet of primo 
nambera, nl» ouch that a eouab the product of tho 
nambere la af.* 

It cob bo ropreeanted la oar Type Theory oa the formula: 

Ve:p»atai 3sf:ifor(prim«) prod(*0 a o (3) 

where X : T mease Xtu object of typo T. porta* Jo 
ta* type of poeMre enteral aembere, prim* b the typo of 
primo iwbtn, aad Ji*f{r) b the typo of tbte of objects 
oftyper.f.0. of : list^rim*) mui si b o Urt of prime*. 
Too function proa* t firffrotfat) — aorta* takao o Cot of 
oimboro aad rotano their prod act, L*. 

prod(«iO - I 
prooXU :: «) - kd x fr*(tl) 

weare nil t* tbo empty Ibt oad :: b the Infix U»t const roc- 
tor. 

We cu prow tab tbortin eerily by ualnj tbo prtmex 
Induction rule: 

P(l) p;pWmo«, a'tpoKaf Pfy ) H# >>(» * «*) 
' K, Va: portal. P(a) 

£.«. «• prow the theorem for a «• 1, then wo oooamo 
it for s d a* aad pmo K for a • p x a', wboro p b a 
primo aambor. Tao greek kttere mbelBag tba H eymbob 
ure tao program fragmeate oaooclaUd with tbo proofr of 
tboM eeqetatr. Tbo program coaetructtoa etep areoclated 
with tab h diction rule dofiaat tbo «y program la term* of 
tbo a and fi program fragment*, ao follow* 

7(a) S if * - 1 tboa a 

obo fi[p, a 1 ) wboro p - /*>ee(a) Ai'o r **t(s) 

wboro /ir*t(e) b tbo amaOart prima aambor tbat divides a 
aad re*C(e) b tbo quotient wbaa a b divided by /w«f (a), 
la tbb caoo /«cf#ri(a) - 7(a). 80 tbo decitloa to use 
prinux lad actio* aaamroa tbat tbo program wJD aao a 
dual term of recnrtloa. Deciding lo aao tbb Motoric form 
of indictloa b omr trot aaroka atop. 

Tbo baoo caat of tbo mductioa bx 

h. 3xl:tut(mrim4)prod{*t)m 1 (4) 



aad tbo f Up caoo Ik 

p: primo, (») 
a'ipoMaf, 

3af :lifl(pronw) prod(af) - a' ho 
3al:iuf(primo) prod(sl) - p x a' 

Tbo baoo caoo b roadlly provod by aUmtaatlag tba ucktaa- 
tial qaaatlflor m (4) aad iatrodacmg tbo witaw* nil for af. 
Tbk ako laotaatlatoa or to oil. 8baUorhjr v tbo ftop caao k 
proTad by trot oOmmatlag tbo oxktoatlal qaaatltar la tbt 
iadacUoa bypotboab part of (5) wttb wltaaw af aad tb«a 
tba oaa la tbo ladactloa coadaa lo a wltb wit&aot p :: af. 
Oyotar b abb to work oat tbat af b /odor #(»'), ao tt 
iaotaatiatoo 0 to f ftcUr $(*'). Tbt program b now: 

/ocfor*(a) m 

if a* I tboaaif 
obo p ^ /oetor«(aO 

wboro p - /trof(a) Ai'» r«jl(a) 

Tbo program b aow faUy lyathatlood. 

Tbo wHbom latrodacod by tbo aflmaaUoa of tbo oxbtoa- 
tUl qaaatltar la tbo ladmcttoa bypotkoob b otaadard. bat 
tbo witaooMO latrodacod by tba attmuaUom of tbo other 
two qaaaUtara coaotltato oaraka atapa. Not* kow tboao 
two wHaoaMf aro tbo valaat of tbo faactloa f to 
tbo baoo aad atop caooo of tbo rotareJoa. 



8 THE MPPLEJDUT TACTIC 

Tbo romoiafag parto of tbo proof wiQ provide a varilca- 
tloa tbat oar cbofcee of induct Soa rob aad oxbtantial wit- 
aeaaea yUld a program tbat m—U tba opaciacatloa, Tba 
romalaiog part of tbo ttip caoo b to proro: 

p: prime, (tt) 
a':po«tat, 
af :lUt{prim*) 

prod(af)«a' h 0 f£I?^f :Kel(pr<me) A 
prod(f7^>f)»[px>' 

Note tbat tbo cxbtoatbl quaatifien bare all boon elimi- 
nated aad tba oxbtoatlal varkbko repUcod by their wit* 
aooooo. Tbo witneoo of tba oxbtoatlal variabb La tbo 
Induction coacmibn, p z af mart bo ebowa to have 
tbo right typo, lief (prime), 00 tho eabgoel, p a af : 
tUt(prim*) t becomee part of tbo udaeUoa coacltteioa. 

Three enb-eapreatione of tbo lad action comclaoloa have 
been placed la boxoa. Tboao aro oxampba of woo ftvnU. 
Wan fnnU aro tboao «ib-exproarioaa of tba bdnctlon 
coacmaloa la whkb U dlfftn from tbo bdnction bypotb- 
eeb. Tbo Ckm eyotam coaUlne a tactic called r\ppl*j>%t 
wkooo tank b to make tbo Ind action bypotboab appear ao 
o eubwexpreeiioa of the Induction coadaaloa. It worka by 
rewriting tbo Indactloa coactooioa to more tbo ware front* 
outward* from their original deeply netted pooHloaa. Tbo 
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rote* eacd by Hf*i*s+l era ceUtd mm rata. Wivi ratee 
ut rewrite mitt of tea form*: 

♦ [njw 

wbero * T ud tbt St are tow witb dbtlugukbed arge- 
m«aU T may be empty, bmt F and tbt &V nut tot 
bt. Tbt ft kit obf wtrt fronte udf btk« now wuvt 
Croat. AppBcetioa of a wave rate rlpptet mm wart fronte 
oat by out 1U0. Repeated applkatioa rfpptee titm aU 
to Hit outeide of tbt ladectloa coaclution. 

Tba wete rate* r^ftind for rate proof are: 

\ki^ i.H$t(t 9 pt) * iMteyytA^ tl^trH (T) 

pr#d flM::| «} * ^r^tf) (0) 

ftx"> * 9 mm (0) 

ApplyUi tbeet wave nlM to tae Induction concteelon of 
(6) rtwrlttt tbt ttep em at WWi. Afttr application of 
rab 7 to tke firet wave front, dim dtrivta: 

p: prime, 
a':n**tnf, 
e/:Juf(*rim«) 

pre<(ef)-«' ^ | f r pWmeA ^ :<u^Hpm) A 

After application of ntW • to tbt eacoed wm Croat, It 
derivea: 

p:yWmc, 

tt xli*t{frimt) 

prtd(eJ')-a' r- # | f : pHmeA p / : li»t{prim«) A 

And afttr application of rait 9, elmultaneoaety, to tbt toe* 
oad aad laird wan front*. It derive*: 

prime, 
n't netful, 
*/:K#t(pwm4) 

prtdUO « a' {jj : prtmtA | i/:Kel(pTtm«) A 
prtd(«n««' 

Fignree 1 aad S Ulottratt tbte ripplteg-oat proetat grepb- 
fceUy by ebowteg two of tba •ttgta tbat tbt Induction 
conctecion coat tbroagb. 

After tbttt tbroa rate appBcatfcej cacb of tbt tbrtt con- 
junct* of tbt tedection conclutfoa W identical to oat of 
tbe fad actios nypotbente. Tbt Induction bypotboab can 
then be oied to prove tbt lad action conciliate a. TheCkm 

«TW» fa aot tbt moat gantiai tens tbat wavt ratet can tab* 
bat m ■ufllcteat far tbt examptee In tbte paptr. A n> 
form fa ctvta b Bundy at (e| 



tyttcm bat a tacik caDed ftrtUiamtf** wbott taab te to 
matcb tbt Induction bypoibaate egatett rtb-oprattiooj 
of tbt tedmctloa concbiatea and replace nay racb eub- 
txprtaalont by (rat. Doteg tbte complete* tbt proof of 
tba tup cant In Ufa example. 

A fuller exptennUoa of rtpptejraf can bo found mfe). 

« MIDDIX-OUT REASONING 

Wt can tblnk abont tba tartka etapa doncrlbtd In (1 In 
tbt foUowtef way, Tbt cbokat of tedacttea tcbtmt tnd 
txteteattel witattttt art actaaUy mndt In racb a way at 
to anew tbt ran of tbt proof to procatd tttoaaafalfr. In 
partkuter, tbt tartka cboteat wQ2 tnnbte tbt tnbttqatat 
ripptt^ut tactic to eeccead. HbwtTtr, atnct tbt tartka 
etepe art meda btfora riypUjvi it applied, tbte It aot 



Tbte obttrvatten raffttte a way to aa tomato tbt tartka 
dtctetena, aamtly: poatpono maUan: tbt tartka tttpt and 
apply r(fpl4jy%t fir§H tbat tetafrate tbt tartka tttpt iato 
tbt ripp&nf at rtqalrtd to tnnbte It to conUaat, Sff to 
thraly wodotbanUdteoftboproofint. te tat procttt 
wt cnkaiaU wbat form tbt baatanJaf of tbt proof tboabf 
tab* to nana tba mlddU of it tmcctad. Wt call tbte ttrat- 
tgy aWdaVt-tal rcu*«af 



4.1 Rippling Under ExfatentlnJ Qanatlflcrn 

W# coaaldtr ftrtt tbt probtem of cbootlnf witnttttt for 
txtetanUany ojaaatlntd ▼arlabtea, Oar notation It not to 
tlimlmU tbt axtttentlal qaaatlfiata at all, bat to modify 
tbt rtwritlnf proctdara to tbat Hppllaf can bt canted 
oat andtr aaJattnUal quaatiiara. lb Oteitrnta tbk pro- 
etat, cosaldtr agate tba proof of tbt print (actoriaaUoa 
tbtortm. Wt rttarn to tbt attp cant of tbt proof, jatt 
after tbt appUcatioa of Induction. 



prprtma, 
n f ;n9#tiU t 



(10) 



3a/:lu<(»TmM)arttXar)- t' h $ 

fn}:list{prvru) prod® = f>T> ' 

Tbte Umt wo bntt marktd in tbt wavo fronte te Jart tUav 
teattef tba tntetentlal quantifier. Rtcntt tbat tbt wltam 
for il In tbt tedactteo concluloa, p s sf, contnintd a 
waYt front, [pTT^ (ton (6)). Tbai Juitiflta ov markbg 
tbte axitttntlaTvartebte an a wart front. . 

Tbte ramark ftatratbtt to all aaJattntiaQy qaantlftad van* 
abtet In induction conclation. Tbate witnttttt will bo tba 
valat of tba tyntbtaUtd profram In tba tttp cant of tbt 
rtenraten, it. tomt fa action of tba vaiaa of tba program 
wban It it caDtd racartWety. Wbtn tbte tancUon It not 
tbt Identity fnnctlon, tban H wffl bo a wata front So aU 
axtettattel mtebtet In induction conduateni art potential 
wava fronte. Note tbat tbt occarronct of tba axtettntlni 
vartebte in tbt qoanUlar dtdaratbn It marktd at a wart 
front, at wall at nil occurrtnett In tba body of tba formate, 
Tbte it becautt tbt quantifier declaration nlto contain* a 
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typ* d«cbiatloa, wbkb win roqolrs rippibf witb w»t« 
ralss for typos, Bks (T) obov*. 

Ws now procsod to apply rtppfcoaf, atlas; tbo umi 
war. rubs u U {S, btt wttboat first rsmovbg too 3xJ : 
ft* f (prims). Omr modJaod rewriting procodaro k*i too 
froodom to lastaatbte oxbtentlnl vorkhbo to oompommd 
Unn* of th« sams typo. Tab it bocnaos to tho aamodHod 
procedore Omm coonpoaad term* coald aavo boom Intro- 
duced os witness** before rif?itjt%t wu applbd. To tens 
edvnatoge of tab possibility too modlnod rewrtUag pro- 
csdars matcbet too loll load o$do of too rewrite rmlo wlU 
sxpreuba* io too gosi, treating enbtentbl vsrbbbe m 
free verbbbo tbat coo bo ia*teanbtsd. Weee so exbtee- 
tbl rwUbU It iastaatbted to o wttaott of too same typo, 
tbo exbtsatbl qeontmer tint covers* It wU mo leagsr bo 
reqalred, sad sboaU bo dsbted. However, o oow exbtsa- 
tloJ qmeatlaer will bo rsqnlred for eecb vnrinbb eeatemed 
io tbo wttasss. lb sssbt wjtb cbeckbg tkot tbo witness 
boo tbo toqmirod typo ood wit* tbo cafcebtioa of tbo types 
of tbo atw sxbtsntisi vnrbMss, it kelp* if tbo wort front* 
io tbo qaantlfi tr/type dtdoroUooo on rippled belbro tboot 
lo tbo body of tbo fon&abo. 

Coootdor, for hstaace, tbo appBcatloa of won rait (7) to 
tbo flrtt wove froot lo (10). 

| Skd.frim, \ 3tl:li*t(prim€) prod rfoTT^ f) = [px}' 

Tbit insteatbtet mi, ood booco 0, to od :: if. Tbo exbtea- 
Usl qeantlitr governing ■ ropbced by two oxbtentlnl 
eaoatiisr*: ooo fee ad ood ooo for tf. Koto bow tho typot 
of tboot mtw enbtentUl varbbbs ere provided by tbt rigbt 
bead side of rob (T). Tbit moo of o typo rate ebo gear- 
oatoot tbat tbt witaess boo tbo rtgbt typo. Note ibo bow 
tbt otcoad occarreace of of, la tbo body of tbo formal*, 
boo booa bsteatbted to kd :: tl. 

Tbo witness to tbo eabteatbl verbals, *t t boo oow bota 
dotormbed. It woo aot oocoooary to do tab uim> 
nam stop. Tbt eppreprbte witaoot woo cakulotod by tbo 
matcebg roatlao dsring tbt oppQcaUom of Wypftjpat. Ho 
ssercbwisieqeJrodfortbb. Tbo wavo front sroend af bod 
to bo rippkd oat ot totao tlmo (io feet, It It bott lo rlppb 
it oat ftrot). Rob (T) woo tbo only moteblaf wovo rob 
bocoBM of tbo restriction Impoood by tbo sob-exprsssloa 
iMfCpWow^tbotypoof tl 

Wo oow opply wovo rab (•} to tbo oocoad wav* frost, 
yielding: 

|3M:pr»ms | 3</:iitl(pnmt) \kZx\todjtl) « [px}' 

Tbb oppDcotioa dooo aot mstnatbte oay oxbtentlnl vsri- 
ebbe, oo doe* aot reqalre oay cbanges In tbo oabtoatiol 
quantifiers. 

Ws can now opply melti-wave rab (9), timeltaaeoesly, to 
tbo oocoad oad fbbd wave fronts, ybldlag: 

dUilutipHmt) m a' 

Tbb rewrite tatteatbtes ad to p. Tab iaotaatbtloa b 
only poaalbb bocaaoo p b a term wttb tbo tamo typo to 
M. 8 lac. bd bat booa totUatbtod Ui oxbtoaUol qatatW 
Oar mail bo romortd. 8hco tit coattoat, oad tboroforo 



coatob m ao aow varbbba, ao aow oxbtoaUal qaaatiftoto 
tro mtfodacod. Tbb boo tbo oldo tfoct of romorbg tbo 
trot wovo front. Tbo comalaiag formalo ataoabai tbo bv 
doctloo bypotboob. 8o brUUmtloa romoooo R, laotant^ 
otlag II to mt , oad boaeo of oad fi to p n a/, oo roqabod. 

A tmt&ar procoot coa bo carrbd oat la tbo boot eooo. Tbo 
rob of tifflt**t bolaf pbyod by tbo Clam toctk, lot*, 
wbkb rowrftot fbmalao aobf tbo boot cooot of rocartlvo 
dotattbaa, 



4.2 Ualng M«U-Variablea for Indue Uon 
Tormo 

Wo aow coaoidor tbo problom of cbooolof ladactloo robo. 
Oar oobtbo b moot o 'boot coaualtmoaf induct bm fUp 
by aoing a ocbamatk ladoctloa rob la wblcb tbo Indac* 
tioa term b o BMto-vorbbb. Wo tboa allow tbb mota- 
Tarbbb to bocomo lattontbtod daring tbo tabooqaoat rip- 
pling oat. 8 toe* motn-vorUbbt oro aot oUowod la Oyst«r*t 
logic, tbb roaooataf b baadbd witbb tbo Cbm pbaaor. 
Cbm opplbo rtpafo jnU oad works oat wbat iadoctba rob 
b roqabod. It tboa la* tracts Oyttor to apply tbo appro- 
prloto ladactloa rab oad tboa rippbt tbo wavo front* that 
ladactloo croatoo. 

To Uhutrato tbb wo rotarn ogola to tbo ttop coot of tbo 
prim* bctorisatloa proof, jot! after tbo oppUcatba of av 
dactloa, it. 

s':po#iaC, 

3af ;/i#r(»Ttnu) prod(aO - «' *-# 

3©fisf(pTtmt) pr*<<® * E 

bat boteod of tbo ladocUoa term, p x n', wo bavs uod 
tbo aMto«varUbb t X t wbkb ttoad* for oomo term coo- 
tolalag o', i*. wo bars ropbcod oU oecarroocos of s', in 
tbo ladoctloa coadaoloa, by X. All aahroroaDy qaaatlaod 
vorbbbs oro caadldoto ladactioa varbbbs, bat o b tbo 
only cand idate La tbb oaampb. In goooral, wt mast try 
rspbdag oacb aaWorsol varbblo, la tarn, by oaeb n mstn- 
varbbb, to too wbkb roplacsmoats pormlt rippUjp%t to 
soccood. Note tbat tbo docbraUoa of p bat booa omlt- 
tod from tbo bypotbosb, sines wo do aot ytt know wbat 
paramotert might bs latrodactd by tbt Induction. 

Rippung-oot tboa procsods so la {4.1 aatU wo gs* to 4 bs 
formab: 

1 3U:prwts \ 3Uili*t[primt) \kdx \ rod{U) » \x\ 

Now rab [9) b tbo only wavo rab tbat sppUos to tbo 
socoad wnvo front. Applying it prodacts: 

1 3ns-; prims \ 3tl:tutjprinu) pvod(tf) « X 4 

wbtro X b lasUntbtod to ad x X. f$t tViMttum tppttos 
to tbb, Inotaatbtiag X* to o* oad bavlag tbt rsaidat: 

3Ad:prtm< fro* 

wbkb b provabb provktsd Ad b witassssd by a prims 
nambor. 
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Wt Hava thus MUblkkad that Ik* rippUj>%4 tactic will 
•uceted provided thai the todies* term iiMxi', where 
M: prime. Induction ralaa art Indexed la Clam by their 
induction terms, ao this Information enablee the systems 
to recover the or una x induction rale and nee It retro- 
ipectlvely. Notice how the appropriate induction rale wm 
choaan m a side-effect of the riffU-o*t tactic 



5 CONCLUSION 

la thie papar we have described a technique, baaed on 
theorem proving, for frnthetisipg proframt from logical 
t pacification!, Wa have eeen that a a aire nee of thie tech- 
nique reqairee eureka etepe. In fa we aaw that II waa 
aeceeaary to produce an appropriate induction rale and 
appropriate existential wltneeaea 'oat of the bine*. Fur- 
thermore, theee eureka atepe provided ail the eaaantial 
structure of the eyniheeieed program, leaving only the ver- 
ification that this program met the specification. Theee 
eareka etepe constitute a barrier to the nee of the pro- 
gram eyntheeii technique. Thia la true whether we want 
to nee it totally aut oma t icall y, with a computer provUU 
i>! the eureka stepe, or aemi-sutomaticalry, with a human 
providing them. 

We have deacribed a way of ftneneing the eureka etepe, ao 
that program tynthesle can be automated. We draw on 
previous work In which we automated the verification part 
of the proof, In particular, our ripoUjmt tactic U highly 
•ucceaiful In automatically guiding the proving of the in- 
duction conclusion from the Induction hypothesis. We 
have arranged the proof construction ao that this middle 
part of the proof ie done & ret. The eureka etepe emerge as 
a aide effect of rippte^ut — they are made In each a way 
as will permit rippfcovJ to continue. We call this tech- 
nique middli'OHt rtosoning. We are currently implement- 
ing middk-out reasoning within the Oyster-Clam system. 
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Figure 1: Induction CWlutkm After One Ripple 
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Abstract 

A common scheme for preventing deadlock in networks is the virtual channel method of 
Dally and Seitz (DS87). Due to the nature of this scheme, an otherwise completely uniform 
network will have non-uniformities introduced into it. The variations introduce several effects, 
ranging from limitations on overall network performance to differences in observed network 
characteristics from node to node and from message to message. 



1 Introduction 

A useful multicomputer network must be both efficient and reliable. A key component of 
reliability is freedom from deadlock. Many schemes have been introduced which prevent deadlock 
in general networks, at varying costs in terms of additional buffers and complexity. A particularly 
interesting scheme is presented by Dally and Seitz [DS87] which relies on virtual channels to 
break dependency cycles in a network. The general scheme is to note where cycles may exist 
in a network and routing scheme and then to add virtual channels in order to transform the 
circular dependencies into spirals which are, by nature, acyclic. 

While open-ended Jfc-ary d-cubes {e.g. meshes and hypercubes) can be made deadlock-free 
by restricting routing to be oblivious and dimension ordered, this cannot be extended to fc- 
ary d-cubes where the edges are "wrapped around" (e.g. tori). A scheme presented in [DS87] 
achieves deadlock freedom in wrapped-around networks by dividing each physical channel into 
two virtual channels and restricting the routing to be dimension ordered. A further restriction 
prevents deadlock: a message uses the "high" set of virtual channels if the number of the 
destination address is higher than the current node, and the "low" set of virtual channels if the 
destination address is less than or equal to the current node address. This prevents deadlock 
by choosing a terminal node at which dependencies in a particular set of virtual channels must 
terminate, thus preventing cyclic dependencies from occurring. 

To achieve maximum performance in a particular network, each of the resources in the 
network should be loaded equally. If one component reaches saturation before the rest, the 
network will tend to slow to the load which can be handled by the saturated component. 1 In a 
sense, this node has become the "weakest link" in a chain, and the strength of the entire chain is 
thus diminished. This, for example, is responsible for early saturation in mesh networks where 
there is extra congestion in the center of the network. A torus network exhibits vertex transitivity 
which means that there are no observable differences between any two nodes in the network. 
Thus, torus networks should not have the problem of congestion in the center because there is 

^his is true for uniform random traffic, but with itou- uniform traffic local saturations may not spread to cover 
the entire network. 
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no distinction between nodes in the center and nodes along the edges of a torus. However, when 
the virtual channel deadlock scheme is implemented, the routing mechanism is perturbed in a 
manner which re-introduces non-uniformities by restricting which messages may use particular 
buffers. 

2 Uneven buffer utilization 

Although the virtual channel scheme successfully prevents deadlock, it does so by the addition 
of underutilized buffers. Furthermore, the effective buffer space available to each node varies 
according to the node's position in the network, creating non- uniformities in the network. 

Consider the set P of all paths between any two nodes allowed by the network and routing 
algorithm. Each channel c in the network is part of a set of paths p c which use c. Also, each 
virtual channel c v of channel c has a set of paths p c „ which use c„. We know that |p c J < \p c \ 
since no virtual channel can have a greater number of paths through it than the channel to which 
it is attached. Now define the virtual channel utilization u«. = for each virtual channel. 
(£ Uc9 - l) Thus, tie. is the fraction of the paths through channel c which use virtual channel 
c v . 

Now, since each channel is represented by multiple virtual channels, the most effective use 
of the buffer space provided by the virtual channels is for each virtual channel to have the same 
fraction of the total traffic. If one virtual channel has more traffic than the others, it becomes 
the bottleneck for the channel. Thus we define the effective buffer size B = may J (Ue ) - If all 
the buffers are utilized equally, B will equal the number of buffers; if only one buffer is'utilized, 
B will equal 1. 

2.1 Unidirectional channels 

Consider a unidirectional ring network under random traffic. This can be thought of as a "slice" 
of a torus, and since, in dimension-order routing, interactions between dimensions ^are minimal, 
the analysis extends to multi-dimensional tori as well. For channel c, |p c | = C-o *• 

The scheme proposed in [DS87] routes messages on the high virtual channels if the node 
number of the source node is less than the destination node, and the low virtual channels 
otherwise. This results in the "terminal" node being node zero. The number of paths through 
the high virtual channels is \p c „\ = £{ s0 i* for node j. The number of paths through the low 
virtual channels is |p c J = Eily+i Tllus < for a unidirectional ring, the effective buffer size is: 
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The scheme implemented in designing the torus routing chip [DS86] injects all messages 
(except from node zero) on the low virtual channels. Messages are routed on the low channels 
until they cross node zero, when they switch to the high virtual channels and continue on them 
until they reach their destination. This also results in the terminal node being node zero. The 
number of paths through the high virtual channels is \p CH \ = * for node h The number 

of paths through the low virtual channels is \p CL | = V *■ Tllus * for a unidirectional ring, 
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Table 1: Effective buffer size for a 16-node unidirectional ring. 
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The number of paths through each set of virtual channels, as well as effective buffer size, are 
shown in Table 1. 

2.2 Bidirectional channels 

For a bidirectional ring, the number of paths through any channel is different because of the 
shorter path lengths. For channel c, \p c \ = X-i=o l - 

For the "dest < current" scheme, the number of paths through the high virtual channels 
is \p CH \ = ICisO*" 0 ' 2 * for node j. The number of paths through the low virtual channels is 
\Pc L \ = IZ|-= / /-(iv-i)/2+i *- 3 Thus. for a bidirectional ring, the effective buffer size is: 

= — p^-^ — r 

max YL £ 0 



We define the sum where the lower bound is negative to be the sum from a lower bound of zero. 
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Table 2: Effective buffer size for a 16- node bidirectional ring. 

For the "crossing node zero" scheme, the number of paths through the high virtual channels 
is \p Cff \ = ' for n0(le TIie number of P aths throug" fcne low visual channels is 

\Pc L \ = 5Zil / a~ i Tnus - for a bidirectional ring, the effective buffer size is: 




The number of paths through each set of virtual channels, as well as effective buffer size, are 
shown in Table 2. 

3 Significance 

The effect of uneven buffer utilization on performance varies with the amount of buffering in the 
network. If wormhole routing is implemented with minimal (i.e., single flit) buffering per virtual 
channel, then performance will not be effected tremendously because the buffers are not playing 
a critical role in achieving performance. On the other hand, increasing buffering in a network 
can dramatically boost its performance. Thus, it makes sense to use all of the buffer space 
available. We see that, in a unidirectional network, the nodes on either side of the "terminal" 
node have only one usable buffer. In a bidirectional network, this is true for fully one-half of 
the nodes. Since these nodes become 'weakest link" nodes, the network will perform roughly 
the same as a network which has only one buffer per channel, despite the presence of twice as 
much buffering capability. 
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Moreover, the amount of effective buffer space at each node varies from node to node. This 
results in a very uneven distribution of usable resources within a network and has several negative 
effects on performance. First of all, the latency which a message experiences depends on which 
portion of the network it travels through, even under a uniform random load. Ideally, a network 
will provide uniform performance under a uniform load so that users will not have to worry about 
being dealt "bad" nodes. Secondly, nodes which have more effective buffer space will have a 
greater number of opportunities to inject messages into the network, potentially preventing 
other nodes from injecting messages and resulting in starvation for those nodes. These effects 
combined to reduce the expected performance of the oblivious torus router described in [BS92]. 

A third non-uniformity exists which may present different views of network contention to 
messages which have different distances between their sources and destinations. For the "Dest 
< Current" scheme, the virtual channel a message uses depends on whether or not the message 
will eventually cross the terminal node (node zero). Consider two messages from the same source 
which are traveling the same direction at some node common to both of their paths from their 
source to their destinations. If one of the messages has a destination which is numbered lower 
than the current node, while the other has a destination which is numbered higher than the 
current node, the two messages will use different virtual channels. Since they use different virtual 
channels, they will compete with different messages for buffer space and thus face different 
levels of congestion. However, from a local point of view, the two messages have arrived on 
the same channel and wish to leave on the same channel, and should not be distinguishable. 
Thus, the deadlock prevention mechanism discriminates between messages which should be 
treated equivalently in a uniform routing strategy. In particular, messages with longer paths 
from source to destination are more likely to cross the terminal node and be routed on the 
high virtual channels than messages which have short paths. Because of this, the contention a 
message faces when competing for buffers differs depending on how far the message must travel 
in the network. 

The same phenomenon exists for the "Crossing node zero" scheme, although the criterion for 
distinguishing between messages is slightly different. In this case, consider two messages which 
have the same destination and are at a node common to both of their paths from their sources 
to their common destination. Again, locally the messages are indistinguishable, as they should, 
from this point on, follow the same path to their destinations. However, if one of the messages 
has crossed the zero node previously and the other has not, the two messages will use different 
virtual channels and, again, face different levels of congestion. Once again, the congestion seen 
by a particular message at a particular node in the network varies depending on how far apart 
the message's source and destination are. 

An especially disturbing result of this phenomenon is seen with the "Dest < Current" scheme. 
Two messages being injected into the network from the same node in the same direction, but 
with different destinations may start out on different virtual channels and face different amounts 
of congestion during injection. Because of this, it may be easier at some nodes to inject messages 
which have either short distances to their destinations of long distances to their destinations, 
depending on the node s location. When some nodes have an easier time injecting their messages 
into the network than other, starvation becomes a problem. Therefore, with this scheme, a 
message's likelihood to be starved varies not only with its location in the network, but with the 
distances that messages it injects into the network need to travel. Fortunately, this problem 
does not occur in the "Crossing node zero" scheme because all messages at any given node are 
injected into the same virtual channel. 



4 Conclusions 



Although the [DS87] scheme provides a simple and efficient method of preventing deadlock 
in torus networks, it unfortunately introduces non- uniformities in otherwise perfectly uniform 
networks. The effect of these non-uniformities is quite severe, ranging from differing observed 
network performance from node to node to a general slowdown of the entire network. Since 
buffering is critical to achieve good performance in any network, it is not possible to eliminate 
or minimize the buffering, and thus the problem. 

One solution is to associate the buffer space in a node with either the entire node (a shared 
buffer pool) or with each physical channel. Thus, all messages would contend for the same 
resources, no matter what virtual channel they were routed on. However, in either case, im- 
plementation becomes considerably more complex because of the need to create two or more 
logically separate buffers within one physical buffer. 

Because this increases the complexity of the router significantly, it becomes worthwhile to 
consider abandoning oblivious routing completely and rely on an adaptive technique to prevent 
deadlock. Non-minimal adaptive routers such as those presented in [NS89, BS92, KS90] prevent 
deadlock while preserving the node-transitivity of torus networks by relying on local rather than 
global schemes for preventing deadlock. At the same time, adaptive routing networks allow more 
flexibility in path selection, making these networks better performers in the presence of uneven 
network utilization. 
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